Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create initial library version #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

M-Fedor
Copy link
Collaborator

@M-Fedor M-Fedor commented Apr 22, 2021

No description provided.

@M-Fedor M-Fedor self-assigned this Apr 22, 2021

elif platform == 'win32':

class DirectoryWatcherWindows(InputReaderInterface):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a prototype now. Will be tested properly on Windows platform.


read_id: str
filename: str
caller: BasecallerInterface
Copy link
Collaborator Author

@M-Fedor M-Fedor Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a design flaw. It works fine with BasecallerMock, but can harm the performance once we have large class objects to be serailized. I'll rework this.

return self.task_batch


if platform == 'linux':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given how big are these, I would split them into platform/xyz files so that this file simply re-exports correct watcher for given platform

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping that someone gives reaction to that. Thanks. When located in single module I had a problem that platform specific code could not be interpreted properly. Therefor those if statements. But your solution solves the problem in more elegant way.

OutputFormatterFastq, SequentialTaskExecutor, ParallelTaskExecutor)


class ArgumentParser(ArgumentParserInterface):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no reason for this to be class if it's only interaction is create it & then call parse_arguments once. I would make it just a function parse_arguments

Copy link
Collaborator Author

@M-Fedor M-Fedor Apr 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it also initializes LibraryComponents class. Instead of returning it all I figured, it would be more understandable, if I designed it as a class. Also, if someone else will want to use that library in a future and he will need to implement his own ArgumentParser, he can simply create a subclass of ArgumentParserInterface and generally just use the existing code infrastructure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using against patterns like subclass. OOP, as done by many, is done wrong. In most cases, functional design fares way better in terms of maintainability, clarity and testability. Perhaps checou out some Sandi Metz presentations, they are usually good inspiration

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this design decision is just silly. But I have a question. I don't like about python libraries that internal code is just mixed with API methods and the only distinction between them is just an underscore at the beginning of a function name. I really liked here that all of the implementation is separated from the library interface and it's clear for a new-comer what the interface looks like and what is the intended usage without spending too much time on trying to understand what the f*** it does. Just look at that:
https://github.com/nanoporetech/ont_fast5_api
It's just tedious piece of code for me to use.

If I use a functional design for this component, is there any preferred pythonic way to declare but not implement functions a thus create pretty interface? Or should I just create a commented document describing the functionality in interface folder? Maybe it's not that necessary in this small component, but I would like to know for a future reference. Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also discuss this one @ppershing ?


def parse_arguments(self) -> Arguments:

self.parser.add_argument('--directory', type=str, nargs='*',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is weird that you add arguments to parser only in parse_args

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you mean.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now, parse_arguments is not idempotent despite it's name suggesting that it does not mutate anything. Try to run it twice and see the fireworks happen ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional design should fix this one.


class OutputFormatterFastq(OutputFormatterInterface):

def write_output(self, output_data: List[OutputData]) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it List or Sequence?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is List, could be a Sequence, but in my opinion should be a Tuple. It should be a tuple since returned by TaskExecutor. So it would be enforced by design that those data are immutable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the function does not care about whether it is list, tuple, or sequence, it should take the superset of them, i.e. the Sequence. Especially if you don't see the requirement changing in the future. In fact, you might want to pass output data generator to the function instead of passing a list

if len(basecalled_seq) == 0:
return

print(">%s" % read_id, file=self.output_stream)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what-the? How could an "interface" have self.output_stream? 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In python there's not really such strict concept interface. Here I preferred code reuse in baseclass over the interface nature. So OutputFormatterInterface defines public API and also implements som common init() method for all the subclasses. It could be discussed though, what are your ideas?

Copy link

@ppershing ppershing Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really such strict concept interface.

If you name something as interface, it better be interface

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

return self.library_components


def _initialize_input_reader(arguments: argparse.Namespace) -> InputReaderInterface:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's unclear why this needs to get whole argspace namespace when it can take bool.
Perpahs more importantly, watch_directory should not be bool but use "store" feature of argparse to store some value to it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But on the other look, at least the interface is consistent with the rest of the helper functions so maybe it is ok as it is


for output in output_data:

(read_id, run_id, read_num, channel_num, start_time, basecalled_seq, quality_scores) = output

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this begs for named tuple

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_data is a named tuple. But I was thinking that putting it into print() function diretly from a named tupple like output.read_id would take even more space and be even more ugly. What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, for the starter, this does not check whether you do correct naming at destructuring. So it is easy to make a mistake.
Second, you can just rename output to o and then it would be more natural.
Finally, you can probably use f-strings instead of format interpolation (assuming python 3) though I guess that is up to discussion whether it will make things nicer or not

This method is always called by ArgumentParser and should NOT be
called by integrator.
"""
raise NotImplementedError

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not @AbstractMethod ?


class OutputFormatterInterface:

def __init__(self, output_name: str, compressed_output: bool) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exact OOP antipattern.
You should have OutputFormatter instance take "FileLike" argument and let it just do the job of writing to it. Or alternatively, supply "write_output" function to this class. The way it is now, you are mixing two responsibilities in one class hierarchy

else:
self.output_stream = open(output_name, "w")

def __del__(self) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

del does not work correctly, if you want RAII style, you have to use "with" statement and context managers

from .data_types import FilePath, OutputData

@dataclass
class BasecallerInterface:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can dataclass be an interface? That make absolutely zero sense

def __init__(self) -> None:
self.caller = None

def set_caller(self, caller: BasecallerInterface) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it rather take caller as constructor param?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'm not sure this is the best design, but I couldn't do better. caller is an external dependency. This library should be able to utilize any basecaller as long as it implements the BasecallerInterface. While ArgumentParser provides integrator with created and initialized TaskExecutor object, it has no way to know what caller to supply. And I think it shouldn't have. It should be an integrator responsibilty in my opinion to supply a correct caller.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that tells you that your whole design is off and has cross-cutting concerns. For example, you can have execute_task_batch to also take caller as an argument. That way, TaskExecutor deals only with executing in some fashion and does not care what the caller is. You can then "bind" caller to task executor in another level of abstraction

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true. But once I have caller as parameter of execute_task_batch() I have to serialize the caller object in parallel execution to communicate the caller object to the workers. If it was a data member, function _call_read() could be made a class method in some TaskExecutorBase class other executors would inherit from and caller object could be accessed naturally. (My original plan with this). That way it would be beneficial for performance since serializing large caller objects can be costly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok, now I see this. But then I don't understand how this works if you set caller after constructing the pool -- once pool is created, I would assume workers are already spawned and it would be too late to assign caller in the parent process

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THAT is an excellent question. Haven't thought about it that way.

from typing import NamedTuple


class OutputData(NamedTuple):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that this has very nasty confusable arguments (string, string, int, string, ....)
I strongly vote against named tuple. It should be data class instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read a few articles and can't figure out what would be the difference. Can you elaborate, please?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Namedtuple can be destructured using tuple destructuring. Dataclass cannot. This prevents you from accidentally shooting yourself to foot as you always must access through named parameters

Arguments can be obtained from ArgumentParser object.
"""

watch_directory: bool

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confusing name. From the name itself I would expect this to contain string/path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants